Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go site 2210 gorule 0000027 must check dbs are in the db xref file #677

Conversation

mugitty
Copy link
Collaborator

@mugitty mugitty commented Jun 24, 2024

No description provided.

@mugitty mugitty requested a review from dustine32 June 24, 2024 18:04
@dustine32
Copy link
Collaborator

Linking to geneontology/go-site#2210

@@ -235,6 +239,8 @@ def parse_line(self, line):
references = self.validate_curie_ids(assoc.evidence.has_supporting_reference, split_line)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mugitty Is there any overlap of logic between self.validate_curie_ids here and self._validate_curie_using_db_xrefs just below? Could self._validate_curie_using_db_xrefs be incorporated into self.validate_curie_ids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustine32 , validate_curie_ids calls _validate_id . This checks for things like "DB:id" or for annotations, prefix has to be in GO id space. However, it does not validate against the syntax pattern specified in the db-xrefs file. This is meant as a catch-all for any identifier in the GAF line that has a database field and id_syntax

Copy link
Collaborator

@dustine32 dustine32 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mugitty Would there be any issue if you incorporated this db-xrefs syntax pattern checking inside _validate_id? It looks like _validate_id assumes every id is a CURIE, which by definition should always have colon-separated database field and id_syntax (else an error will be reported). The only complication I could think of is if a metadata/db-xrefs is not supplied when _validate_id is called but you could just make this optional in _validate_id (i.e., if self.config.db_type_name_regex_id_syntax is not None).

I guess my main concern is separating validation logic throughout the code that really should be in the same place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustine32, as you suggested, I could incorporate in _validate_id and add a check for self.config.db_type_name_regex_id_syntax is not None.

Let me update

@dustine32
Copy link
Collaborator

@mugitty Do we already have a test for gorule 0000027 somewhere?

@mugitty
Copy link
Collaborator Author

mugitty commented Jun 25, 2024

@dustine32 , I pushed updates to combine id syntax change into _validate_id method

@dustine32
Copy link
Collaborator

@mugitty Awesome, thank you so much! Do we have a test for wrong IDs that should return that "does not match any id_syntax patterns" warning message? It would be good to confirm your new code can be triggered by this.

@mugitty
Copy link
Collaborator Author

mugitty commented Jun 25, 2024

I ran it through https://github.com/geneontology/go-site/blob/master/docs/gorules_test_errors.gaf . The output error.json file contains things such as
...
{
"level": "WARNING",
"line": "MGI:1100518\tSmad7\tbla\tinvolved_in\tGO:0017015\tMGI:MGI:3836072|PMID:18952608\tIC\tGO:0060389\tP\tGORULE_TEST:0000020-3\tSMAD\tprotein_coding_gene\ttaxon:10090\t20090211\tGO_Central\t\n",
"type": "Invalid identifier",
"message": "GORULE:0000027: 3836072 does not match any id_syntax patterns for MGI in dbxrefs",
"obj": "MGI:MGI:3836072",
"taxon": "NCBITaxon:10090",
"rule": 27
},
{
"level": "WARNING",
"line": "UniPotKB\tQ9HC96\tCAPN10\tinvolved_in\tGO:0006921\tPMID:23072806\tIDA\t\tP\tGORULE_TEST:0000027-1 Calpain-10\tCAPN10,KIAA1845\tprotein\ttaxon:9606\t20140213\tGO_Central\t\n",
"type": "Invalid identifier",
"message": "GORULE:0000027: UniPotKB not found in list of database names in dbxrefs",
"obj": "UniPotKB:Q9HC96",
"taxon": "NCBITaxon:9606",
"rule": 27
},
{
"level": "WARNING",
"line": "UniProtKB\tQ9HC96\tCAPN10\tinvolved_in\tGO:0006921\tPMID:PMID:14561399\tIDA\t\tP\tGORULE_TEST:0000027-3 Calpain-10\tCAPN10,KIAA1845\tprotein\ttaxon:9606\t20140213\tGO_Central\t\n",
"type": "Invalid identifier",
"message": "GORULE:0000027: PMID:14561399 does not match any id_syntax patterns for PMID in dbxrefs",
"obj": "PMID:PMID:14561399",
"taxon": "NCBITaxon:9606",
"rule": 27
},
...

This update outputs warnings. Groups can update the id_syntax and or ids based on these errors

@dustine32
Copy link
Collaborator

@mugitty Thanks again! It's great that you confirmed the go-site checks will run correctly with this change but usually we also test this type of functionality within the ontobio/tests. As a comparable example, I'm looking at the test for GoRule43 and it recreates a small part of the gorefs metadata to test the GO_REF validation functionality. Could we do the same thing in test_qc.py (i.e., add a test_gorule27 function) with a small sample of the db-xrefs metadata?

@mugitty
Copy link
Collaborator Author

mugitty commented Jun 26, 2024

@dustine32, since these are validation tests, I added tests to both test_gafparser.py and test_gpad_parser.py

@dustine32 dustine32 self-requested a review July 8, 2024 18:18
Copy link
Collaborator

@dustine32 dustine32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mugitty Thanks so much for your patience and thanks for adding the tests! This all looks good now.

@mugitty mugitty merged commit 85280f3 into master Jul 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants